1 Introduction

In this evaluation, there are total 7 datasets. We used the evaluation metrics implemented in OmicsEV package to evaluate these datasets. The sample and class information for each dataset are shown in the table below.

class Array d1 d2 d3 d4 d5 d6
Basal 17 17 17 17 17 17 17
Her2 12 12 12 12 12 12 12
LumA 19 19 19 19 19 19 19
LumB 22 22 22 22 22 22 22
None 16 16 16 16 16 16 16

The detailed sample information is shown below.

sample class batch order
TCGA.A2.A0CM Basal 1 1
TCGA.A2.A0D0 Basal 1 2
TCGA.A2.A0D1 None 1 3
TCGA.A2.A0D2 Basal 1 4
TCGA.A2.A0EQ Her2 1 5
TCGA.A2.A0EV LumA 1 6
TCGA.A2.A0EX LumA 1 7
TCGA.A2.A0EY LumB 1 8
TCGA.A2.A0SW LumB 1 9
TCGA.A2.A0SX Basal 1 10
TCGA.A2.A0T1 Her2 1 11
TCGA.A2.A0T2 Basal 1 12
TCGA.A2.A0T6 LumA 1 13
TCGA.A2.A0T7 LumA 1 14
TCGA.A2.A0YC LumA 1 15
TCGA.A2.A0YD LumA 1 16
TCGA.A2.A0YF LumA 1 17
TCGA.A2.A0YG LumB 1 18
TCGA.A2.A0YI LumA 1 19
TCGA.A2.A0YL LumA 1 20
TCGA.A2.A0YM Basal 1 21
TCGA.A7.A0CD LumA 1 22
TCGA.A7.A0CE Basal 1 23
TCGA.A7.A0CJ LumB 1 24
TCGA.A8.A06N LumB 1 25
TCGA.A8.A06Z LumB 1 26
TCGA.A8.A076 LumB 1 27
TCGA.A8.A079 LumB 1 28
TCGA.A8.A09G Her2 1 29
TCGA.A8.A09I LumB 1 30
TCGA.AN.A04A None 1 31
TCGA.AN.A0AJ LumB 1 32
TCGA.AN.A0AL Basal 1 33
TCGA.AN.A0AM LumB 1 34
TCGA.AN.A0AS LumA 1 35
TCGA.AN.A0FK LumA 1 36
TCGA.AN.A0FL Basal 1 37
TCGA.AO.A03O None 1 38
TCGA.AO.A0J6 None 1 39
TCGA.AO.A0J9 None 1 40
TCGA.AO.A0JC None 1 41
TCGA.AO.A0JE None 1 42
TCGA.AO.A0JJ None 1 43
TCGA.AO.A0JL None 1 44
TCGA.AO.A0JM None 1 45
TCGA.AO.A126 None 1 46
TCGA.AO.A12B None 1 47
TCGA.AO.A12E None 1 48
TCGA.AR.A0TR LumA 1 49
TCGA.AR.A0TT LumB 1 50
TCGA.AR.A0TV LumB 1 51
TCGA.AR.A0TX Her2 1 52
TCGA.AR.A0U4 None 1 53
TCGA.BH.A0EE Her2 1 54
TCGA.BH.A0HP LumA 1 55
TCGA.A2.A0T3 LumB 1 56
TCGA.A7.A13F LumB 1 57
TCGA.AO.A12D None 1 58
TCGA.AO.A12F None 1 59
TCGA.AR.A0TY LumB 1 60
TCGA.AR.A1AQ Basal 1 61
TCGA.AR.A1AV LumA 1 62
TCGA.AR.A1AW LumB 1 63
TCGA.BH.A0AV Basal 1 64
TCGA.BH.A0C1 LumA 1 65
TCGA.BH.A0C7 LumB 1 66
TCGA.BH.A0E9 LumA 1 67
TCGA.C8.A12L Her2 1 68
TCGA.C8.A12P Her2 1 69
TCGA.C8.A12Q Her2 1 70
TCGA.C8.A12T Her2 1 71
TCGA.C8.A12U LumB 1 72
TCGA.C8.A12V Basal 1 73
TCGA.C8.A12W LumB 1 74
TCGA.C8.A12Z Her2 1 75
TCGA.C8.A130 LumB 1 76
TCGA.C8.A131 Basal 1 77
TCGA.C8.A134 Basal 1 78
TCGA.C8.A135 Her2 1 79
TCGA.C8.A138 Her2 1 80
TCGA.D8.A13Y LumB 1 81
TCGA.D8.A142 Basal 1 82
TCGA.E2.A10A LumA 1 83
TCGA.E2.A150 Basal 1 84
TCGA.E2.A154 LumA 1 85
TCGA.E2.A159 Basal 1 86

2 Overview

dataSet # proteins (genes) # proteins (genes) [50%] complex_ks gene_wise_cor sample_wise_cor AUROC func_auc
Array 17814 17814 0.1751827 0.3020985 0.1970433 0.9102871 0.8098193
d1 20501 18694 0.2415586 0.3293080 0.1420492 0.9952153 0.8219823
d2 20501 18717 0.2128561 0.3348777 0.1421854 0.9928230 0.7950738
d3 20501 18694 0.2890523 0.3354781 0.1420492 0.9868421 0.8352795
d4 20501 18694 0.2822335 0.3368288 0.1420492 0.9868421 0.8284495
d5 20501 18694 0.2108862 0.3208170 0.1420492 0.9928230 0.8257509
d6 20501 18694 0.2436248 0.3279982 0.1381387 0.9880383 0.8158570

3 Descriptive

3.1 Protein/gene identification and quantification

The table below shows the number of identified proteins or genes for each dataset. We take the proteins or genes filtered by 50% missing value as quantified proteins or genes.

dataSet # proteins (genes) # proteins (genes) [50%]
Array 17814 17814
d1 20501 18694
d2 20501 18717
d3 20501 18694
d4 20501 18694
d5 20501 18694
d6 20501 18694

Upset chart below showing overlap in proteins or genes identified in each dataset. Numbers of identified proteins or genes shared between different datasets are indicated in the top bar chart and the specific datasets in each set are indicated with solid points below the bar chart. Total identifications for each dataset are indicated on the left as ‘Set size’.

3.2 Protein/gene number distribution

The figures below show the number of proteins or genes identified in each sample. The samples from different batches are coded in different shapes and the samples from different classes are coded in different colors.

Arrayd1d2d3d4d5d6

4 Data visualization

4.1 Protein or gene expression distribution

The boxplots show the protein or gene expression distribution across samples. X axis is sample ordered by input order. Y axis is log2 transformed protein or gene expression. The samples from different classes are coded in different colors.

Arrayd1d2d3d4d5d6

The density plots show the protein or gene expression distribution across samples. X axis is log2 transformed protein or gene expression. Y axis is density.

4.2 Batch effect (Heatmap ordered by batches)

In these figures, each column is a sample, each row is also a sample. The color indicates the correlation between samples. The samples are ordered by batches.

Arrayd1d2d3d4d5d6

4.3 Protein or gene coefficient of variation (CV) distribution

Arrayd1d2d3d4d5d6

4.4 Missing value distribution

The missing value distribution can give an overview of the percent of missing values of all proteins or genes in both the QC and experiment samples.

Arrayd1d2d3d4d5d6

4.5 Unsupervised analysis of samples: PCA

Arrayd1d2d3d4d5d6

4.6 Unsupervised analysis of samples: Cluster analysis

Arrayd1d2d3d4d5d6

5 Quantitative evaluation

5.1 Correlation between proteins: within vs between protein complexes

The table showing below is a summary of the evaluation. ‘diff’ is Cor(intra) - Cor(inter). ‘ks’ is the statistic value of Kolmogorov-Smirnov test.

dataSet InterComplex IntraComplex diff ks
Array 0.008 0.086 0.078 0.175
d1 0.033 0.164 0.131 0.242
d2 0.003 0.108 0.105 0.213
d3 0.016 0.182 0.166 0.289
d4 0.010 0.171 0.161 0.282
d5 0.062 0.180 0.118 0.211
d6 0.032 0.163 0.131 0.244

5.2 Correlation between mRNA and protein: gene-wise correlation

dataSet n n5 n6 n7 n8 median_cor
Array 8312 1379 504 115 10 0.302
d1 9129 1911 773 210 21 0.329
d2 9131 1989 837 222 22 0.335
d3 9129 1993 823 223 24 0.335
d4 9129 2006 837 225 24 0.337
d5 9129 1764 693 185 20 0.321
d6 9129 1931 763 207 20 0.328

5.3 Correlation between mRNA and protein: sample-wise correlation

5.4 Phenotype prediction

Build model for prediction: LumA,LumB .

dataSet Variables ROC Sens Spec
Array 17814 0.910 0.789 0.818
d1 18694 0.995 0.947 1.000
d2 18717 0.993 0.947 0.909
d3 18694 0.987 0.947 0.909
d4 18694 0.987 0.947 0.955
d5 18694 0.993 0.947 0.909
d6 18694 0.988 0.895 0.909

5.5 Co-expression network based function prediction

In this evaluation, each dataset was used to build co-expression network. For a selected network and a selected function term (such as GO or KEGG), proteins/genes annotated to the term and also included in the network were defined as a positive protein/gene set and other proteins/genes in the network constituted the negative protein/gene set for the term. For a selected function term, we use some of the proteins/genes as the seed protein/gene, then we use random walk algorithm to calculate scores for other proteins/genes. A higher score of a protein/gene represents a closer relationship between the protein/gene and the seed proteins/genes. Finally, for each selected function term, we calculate an AUROC to evaluate the prediction performance.

Array d1 d2 d3 d4 d5 d6
Allograft rejection 0.922 0.993 0.975 0.99 0.987 0.991 0.99
Aminoacyl-tRNA biosynthesis 0.663 0.804 0.807 0.759 0.797 0.809 0.816
Antigen processing and presentation 0.867 0.814 0.796 0.835 0.858 0.83 0.862
Asthma 0.921 0.92 0.858 0.957 0.957 0.908 0.906
Autoimmune thyroid disease 0.973 0.964 0.942 0.94 0.938 0.965 0.949
Cell adhesion molecules (CAMs) 0.72 0.81 0.756 0.815 0.786 0.827 0.818
Complement and coagulation cascades 0.75 0.822 0.824 0.895 0.856 0.766 0.814
DNA replication 0 0.89 0.87 0.843 0.876 0.897 0.859
Drug metabolism - other enzymes 0.813 0.595 0.579 0.657 0.577 0.607 0.629
ECM-receptor interaction 0.839 0.876 0.827 0.838 0.838 0.848 0.851
Glycosphingolipid biosynthesis - lacto and neolacto series 0.755 0.847 0.708 0.704 0.701 0.776 0.795
Glycosylphosphatidylinositol(GPI)-anchor biosynthesis 0.82 0.664 0.642 0.672 0.653 0.703 0.72
Graft-versus-host disease 0.935 0.99 0.982 0.99 0.992 0.988 0.99
Homologous recombination 0 0.842 0.755 0.739 0.763 0.8 0.773
Intestinal immune network for IgA production 0.873 0.853 0.859 0.89 0.881 0.833 0.816
Malaria 0.826 0.838 0.828 0.841 0.842 0.835 0.806
Metabolism of xenobiotics by cytochrome P450 0.8 0.647 0.749 0.779 0.79 0.789 0.769
Mismatch repair 0 0.857 0.769 0.778 0.85 0.821 0.837
Oxidative phosphorylation 0.791 0.822 0.757 0.841 0.828 0.819 0.835
Parkinsons disease 0.736 0.811 0.698 0.805 0.785 0.786 0.797
Primary immunodeficiency 0.759 0.782 0.767 0.781 0.786 0.826 0.806
Proteasome 0.895 0.865 0.8 0.892 0.912 0.849 0.895
Protein export 0.807 0.845 0.769 0.876 0.837 0.765 0.839
Retinol metabolism 0.754 0.707 0.812 0.849 0.822 0.878 0.824
Ribosome 0.935 0.943 0.869 0.949 0.932 0.942 0.94
Ribosome biogenesis in eukaryotes 0.81 0.71 0.718 0.746 0.757 0.694 0.723
RNA polymerase 0 0.762 0.709 0.805 0.774 0.794 0.728
RNA transport 0.852 0.746 0.724 0.74 0.745 0.732 0.731
Spliceosome 0.699 0.776 0.772 0.803 0.812 0.772 0.794
Staphylococcus aureus infection 0.874 0.953 0.922 0.94 0.933 0.914 0.951
Steroid hormone biosynthesis 0.607 0.793 0.795 0.831 0.791 0.847 0.791
Systemic lupus erythematosus 0.89 0.828 0.85 0.854 0.854 0.838 0.836
Terpenoid backbone biosynthesis 0.811 0.721 0.725 0.775 0.773 0.698 0.776
Type I diabetes mellitus 0.82 0.885 0.859 0.873 0.889 0.896 0.904
Viral myocarditis 0.782 0.773 0.754 0.829 0.788 0.757 0.795